-
Notifications
You must be signed in to change notification settings - Fork 37
Initial riscv64 vector support (uses standard vector instrinsics for rvv 1.0. Presently VLEN=256 only.) #1037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0637941
to
6b4f845
Compare
@mjosaarinen If you have |
dcab861
to
38e79bb
Compare
Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end. |
Yeah you can easily double the speed with autovectorization alone, and some Google folks were of the opinion that they wanted to rely on that entirely in BoringSSL (RISC-V Android etc), rather than maintain a hand-optimized version. The resulting code is pretty wild; I looked at that when considering RISC-V ISA extensions ( see slides 17 for example in https://mjos.fi/doc/20240325-rwc-riscv.pdf ). It was almost "too good" -- I suspect that Google has used those NTTs as a microbenchmark when developing LLVM autovectorizers :) |
Yeah, sorry for abusing your CI like that (I wasn't expecting it to be that extensive), I could have just read the documentation. I'll set up this nix thing. |
@mjosaarinen Sorry, we should have pointed that out earlier. With the |
@mkannwischer @mjosaarinen The RV functional tests in the ci-cross shell don't seem to pass. % nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross
% CROSS_PREFIX=riscv64-unknown-linux-gnu- make func OPT=1 AUTO=1 -j32
% EXEC_WRAPPER=qemu-riscv64 make run_func_512 -j32 OPT=1 AUTO=1
qemu-riscv64 test/build/mlkem512/bin/test_mlkem512
ERROR (test/test_mlkem.c,41)
ERROR (test/test_mlkem.c,225)
make: *** [Makefile:53: run_func_512] Error 1 Same for Could you investigate? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RV64 cross tests in CI are failing. I don't know if this is an issue with the code or the CI setup, but it needs looking into.
It looks like some things are running fine, and some things have build problems. I suspect that platform auto-detection code (in the C header files + makefies) is to blame (failing no_opt means "no optimization"?) It's a quite complicated CI -- can you tell me what would be the best way to debug this stage of the CI build process locally? |
Sure. If you have nix installed, then it's exactly what I posted above: You open the nix shell with % nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross Building using cross compiler:
And then running:
The base command line used by
What is confusing to me is that this happens regardless of whether If you don't want be bothered by auto-settings, we can for now set
|
It looks like the issue is solely due to
it'll fail with
Probably some configuration is missing in the invocation of |
|
Hmm. I'm not sure if qemu is able to pull the standard runtime and dynamic libraries that match with the "vector ABI" (the calling convention changes somewhat with vector registers). I had to link the test programs with Anyway, the code assumes "vector 1.0" extension, 256-bit VLEN, and the standard "gc" stuff, so hat looks correct. I left the Keccak instruction stuff out. |
I adjusted the CI. Most tests pass, but there are some issues in the ACVP tests and the monobuild examples. The ACVP one seems to be an issue in the test script (not expecting parameters in the exec-wrapper), while the monobuild failure seems to be an architecture confusion. I'll take a look. |
951ab19
to
0c4d7e8
Compare
@mjosaarinen Whops, apologies for the accidental close. |
fae0330
to
211aa76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're close I think 👍
- The initial reduction can be replaced by the final normalization, as in the AArch64 and x86_64 backends.
- I think the conditional subtraction in the invNTT doesn't work, the test #1205 witnesses an overflow that ensues on a constant input of value
-673
. @mjosaarinen Can you remove thecsub
from the butterflies and instead add the necessary reductions where needed? It's probably the same as in the AArch64 invNTT I suppose, see https://github.com/pq-code-package/mlkem-native/blob/main/dev/aarch64_clean/src/intt.S#L246 and following. - We should extend
src/sys.h
with the necessary system capability and guard the backend accordingly -- see e.g. how it's done for AVX2. Otherwise, one cannot target hosts with heterogeneous capabilities with a single build (e.g. RV64 CPU with/without vector extension, or with vector extension but different length). Probably we want a capability for "Vector Extension of VL=256". What's the correct name here? Once we have that, we should extend CI to test the library on a CPU a) without vector extension, b) with vector extension, but different VL -- in both cases, it should just fall back to the C code, of course. I can do that part, though.
211aa76
to
6128d6a
Compare
6128d6a
to
161b478
Compare
161b478
to
e6623f8
Compare
@mjosaarinen Regarding e15004e: Is there anything fundamentally different about the instructions available in RV64 that prevents the 'normal' signed arithmetic to be used throughout the invNTT? I don't understand the need to work with unsigned coefficients yet. |
Well, the invNTT takes in signed values but internally keeps them in the canonical range [0,q-1] throughout (I don't think FIPS 203 requires signed integers anywhere) This could be seen as a precaution, but it is not so bad for performance: Since there is now an initial reduction, we can remove the final normalization which is relatively expensive: b31855d |
@mjosaarinen Thanks, yes you are right this is a valid strategy. On the other hand, we don't require normalized outputs and always normalised only as part the serialisation at the end (nowhere else). This works based on a moderate output bound of 8q. So we could just get rid of any final normalisation in the invNTT and it would be fine, assuming a selected set of reductions at a few coefficients in the middle of the invNTT -- but I would strongly suspect those to be much less expensive than a cadd+csub in every butterfly. Would you mind trying this? Again, not saying your approach is wrong, but we don't really gain anything from it because the rest of the code does not need stronger output bounds. |
Well, I tried little bit of that (and the code can pass the tests with some of those modifications) but I realized that I was just guessing again what those "selected set of reductions" could be. Let's leave that for later, ok? You know, an appropriate thing would be to work out the actual bounds systematically -- for every step and (proven-correct) arithmetic "gadget" -- and always have the proven output properties as input assumptions further down the line. Anyway, at present, this is not a particularly optimized implementation -- there are some important things to address in relation to performance, like Keccak. |
@mjosaarinen Hanno and I have done detailed analysis of the various reduction strategies for the INTT. See the "fastntt3" branch, for example, which does INTT (in C) using a "1,1,2,3" layer merging strategy. It also turns out that much better reduction strategy can be achieved if we can prove that the inputs to the INTT have magnitude bounded by INT16_MAX/2. We have proved this for the C code (on the "refine_bounds") branch and AArch64 code, but need to extend that proof to cover x86_64, which will require some modification of the basemul code. Perhaps we will return to this another time? |
@mjosaarinen Let's do as you say. I think one can follow the Arm backend here, but if that's so -- I will try out to be sure -- it can be a follow-up. Similarly, you mentioned you aim for a VL agnostic version. Can this be a follow up too? |
Yes! For the record, this is exactly what the HOL-Light proofs for Arm are doing. You will see tight bounds in the HOL-Light specs that would be hard to verify on paper and impossible by eye. However, the frontend never relies on those bounds. The bounds required by the frontend can so far be audited by eye. This means, we go as far into lazy reduction as can easily be annotated and followed in the code, but not further. That may leave some unnecessary reduction in the code, but at least for the invNTT the returns of more and more lazy reduction diminish quickly. |
I think it can be a follow-up - there is always a fallback to the C back-end. The thing about that is; while some parts can be easily made VLEN-agnostic using the "strip-mining" approach (discussed in the RISC-V manual), it seems to me that for efficient NTT/INTT, one would need to have separate versions for VLEN in { 128, 256, 512, 1024, perhaps others } due to the actual register allocation. btw.. In RVV, the vector length VL is different fom VLEN -- VL holds the dynamic vector length (in elements, like bytes or floats) while VLEN is the hardware size (in bits) of each of the 32 vector registers. Vectors (VL) do not have to be a power-of-two in size and can span up to 8 vector registers (LMUL). The max hardware register size allowed by the spec is VLEN=65536 bits. Due to LMUL one can technically do a lot of stuff in a single instruction (operate on 8*256 = 2048 bits even with VLEN=256) so that's an another thing to explore. This code uses only LMUL=1 and LMUL=2. |
@mjosaarinen Very interesting, thank you. I should say "VLEN" then, not "VL". What else needs doing then for this PR, from your perspective? I will aim to go through everything again, probably later or tomorrow. Do you have input on my note above on runtime detection of vector extension support?
|
b31855d
to
040c339
Compare
0-diff rebase to remove merge commits. |
Signed-off-by: Hanno Becker <[email protected]>
…sics.) Signed-off-by: Markku-Juhani O. Saarinen <[email protected]> Signed-off-by: Matthias J. Kannwischer <[email protected]> Signed-off-by: Hanno Becker <[email protected]>
- Don't skip over 'opt' tests in CI when testing RV64 - Configure qemu to use 256-bit VL. Once we support other VLs, the invocation of qemu needs to be generalized accordingly. Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Matthias J. Kannwischer <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
040c339
to
1ffff09
Compare
Did some history cleanup, and re-added bounds assertions aiming to check that coefficients stay unsigned-canonical. |
It no longer even compiles (this file is needed.)
|
Signed-off-by: Markku-Juhani O. Saarinen <[email protected]>
Well, I was thinking of running some benchmarks on real RVV hardware (SpecemiT X60) on Monday-Tuesday and perhaps tweaking some things a bit based on results.
From a practical viewpoint, it should be ok to have the existence of RVV (yes/no) as a build-time flag. After all, I understand that the Android and Ubuntu kernel & userland will be built for an RVA23-like processor profile and wouldn't work without Vector. Now, the physical vector length (VLEN) is another matter; it is intended to be dynamic and is supposed to be handled by the programs directly. RVV code can do that with 1 instruction (it's designed to be fast) and without syscalls etc. |
Summary:
rv64v support (risc-v vector extension 1.0, which is available on newer application-class silicon.)
Steps:
If your pull request consists of multiple sequential changes, please describe them here:
Performed local tests:
lint
passingtests all
passingtests bench
passingtests cbmc
passingDo you expect this change to impact performance: Yes/No
yes (risc-v only)
If yes, please provide local benchmarking results.
Roughly 2.5x perf on silicon with vector hardware.